Coronavirus disease(COVID19) is an infectious disease caused by a newly discovered coronavirus. It has spread to numerous countries across all continents since it was first discovery in Wuhan, China back in Nov 2019 and was declared as pandemic by WHO on March 11 2020.
Various countries has came out measure/restrictions to respond to COVID-19. Since “circuit breaker”, a partial nationwide lockdown, where only essential services were allowed to open, Singapore(SG) residents have started to feel a great impact on daily life where they are encouraged to stay home as much as possible and wearing of mask became mandatory when going out. SG government has constantly revising policies and social restrictions. Three phases of planned reopening were announced since 19 May namely “Safe Reopening” (Phase1) “Safer Transition” (Phase2), and finally “Safe Nation” (Phase3).
Microblogging has become one of the most useful tools for sharing everyday life events and news and for expressing opinions about these events. As Twitter posts are short and constantly being generated, they are a great source for providing public sentiment towards events that occurred throughout the COVID-19 period in Singapore.
In our Capstone Project, we perform exploratory data analysis about SG COVID situation and sentiment analysis and modeling on the tweets about COVID19 to seek to answer the following research questions:
What are the main prevalent sentiment and emotions expressed in words in Singapore tweets about current COVID situation?
Is there any change of sentiment over a period of time amidst global reopening with higher vaccination rate, in contrast to growing new daily cases/death locally?
For our data science project, we activated the following packages, using the Tidyverse approach.
# Load necessary packages
pacman::p_load(tidyverse, broom, modelr, lubridate,
tidytext, wordcloud2, wordcloud, reshape2,
textdata, huxtable, # Employing Lexicon
gridExtra, psych, Hmisc, sandwich,
scales, rtweet, glue, ggplot2,
caret, DT, dplyr)
my_colors <- c("#05A4C0", "#85CEDA", "#D2A7D8", "#A67BC5", "#BB1C8B", "#8D266E", "gold4", "darkred", "deepskyblue4")
my_theme <- theme(plot.background = element_rect(fill = "grey98", color = "grey20"),
panel.background = element_rect(fill = "grey98"),
panel.grid.major = element_line(colour = "grey87"),
text = element_text(color = "grey20"),
plot.title = element_text(size = 22),
plot.subtitle = element_text(size = 17),
axis.title = element_text(size = 15),
axis.text = element_text(size = 15),
legend.box.background = element_rect(color = "grey20", fill = "grey98", size = 0.1),
legend.box.margin = margin(t = 3, r = 3, b = 3, l = 3),
legend.title = element_blank(),
legend.text = element_text(size = 15),
strip.text = element_text(size=17))Then, we imported our dataset.
Data Source 1: SG COVID DATA
SSA <- readRDS("covid19.rds")The dataset contains time series data on covid-19 cases in Singapore on counts of confirmed, discharged, hospitalized, deaths, imported cases.
Within the dataset, Some of few key fields we are interested in reflecting current Singapore COVID-19 situation, namely Daily.Confirmed, Daily Deaths, Still.Hospitalised and Intensive.Care.Unit..ICU which will also serve as our dependent (discrete x) variable.
Notes:
All figures (excluding variables with names containing “MOH report”) are as at press release within the day and are not back-dated to update any changes that the Ministry of Health (MOH) might have made.
“Daily imported” and “Daily Local transmission” make up “Daily Confirmed”.
Still Hospitalised" is computed based on “Total Confirmed” - “Cumulative Discharged” - “Discharged to Isolation” - “Cumulative Deaths” - “Tested positive demise” (summed). This might not tally with the sum of “General Wards MOH report” and “Intensive Care Unit (ICU)”, indicating dirty data.
“Cumulative Vaccine Doses”, “Cumulative Individuals Vaccinated”, and “Cumulative Individuals Vaccination Completed” added from 1 Jul 2021. The data is tied to the date of the report but the count is as of the previous day. i.e. Figures indicated for 1 Jul 2021 reflect the total number of doses of COVID-19 vaccines as of Jun 30. “Cumulative Individuals Vaccinated” refers to the number of individuals who have received at least one dose of vaccine.
glimpse(SSA)## Rows: 999
## Columns: 37
## $ Date <chr> "2020-01…
## $ Daily.Confirmed <int> 1, 2, 1,…
## $ False.Positives.Found <int> NA, NA, …
## $ Cumulative.Confirmed <int> 1, 3, 4,…
## $ Daily.Discharged <int> 0, 0, 0,…
## $ Passed.but.not.due.to.COVID <int> 0, 0, 0,…
## $ Cumulative.Discharged <int> 0, 0, 0,…
## $ Discharged.to.Isolation <int> 0, 0, 0,…
## $ Still.Hospitalised <int> 1, 3, 4,…
## $ Daily.Deaths <int> 0, 0, 0,…
## $ Cumulative.Deaths <int> 0, 0, 0,…
## $ Tested.positive.demise <int> 0, 0, 0,…
## $ Daily.Imported <int> 1, 2, 1,…
## $ Daily.Local.transmission <int> 0, 0, 0,…
## $ Local.cases.residing.in.dorms.MOH.report <int> NA, NA, …
## $ Local.cases.not.residing.in.doms.MOH.report <int> NA, NA, …
## $ Intensive.Care.Unit..ICU. <int> 0, 0, 0,…
## $ General.Wards.MOH.report <int> NA, NA, …
## $ In.Isolation.MOH.report <int> NA, NA, …
## $ Total.Completed.Isolation.MOH.report <int> NA, NA, …
## $ Total.Hospital.Discharged.MOH.report <int> NA, NA, …
## $ Requires.Oxygen.Supplementation.or.Unstable <int> NA, NA, …
## $ Linked.community.cases <int> NA, NA, …
## $ Unlinked.community.cases <int> NA, NA, …
## $ Phase <chr> "", "", …
## $ Cumulative.Vaccine.Doses <int> NA, NA, …
## $ Cumulative.Individuals.Vaccinated <int> NA, NA, …
## $ Cumulative.Individuals.Vaccination.Completed <int> NA, NA, …
## $ Perc.population.completed.at.least.one.dose <chr> "", "", …
## $ Perc.population.completed.vaccination <chr> "", "", …
## $ Sinovac.vaccine.doses <int> NA, NA, …
## $ Cumulative.individuals.using.Sinovac.vaccine <int> NA, NA, …
## $ Doses.of.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Cumulative.individuals.using.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Number.taken.booster.shots <int> NA, NA, …
## $ Perc.population.taken.booster.shots <chr> "", "", …
## $ X <lgl> NA, NA, …
Data Source 1: Tidy & Transform
The first thing we did with our loaded dataset was to remove the non-relevant columns and remain only those we will perform analysis and modeling. We also transformed the dataset into long format for data exploratory visualization.
SSA1<- tibble(SSA)
SSA1 <- SSA1[-(1:626) , -(18:37)]
SSA1 <- SSA1[ , -(11:16)]
SSA1 <- SSA1[ , -(3:8)]
SSA1 <- SSA1[-(35:373) , ]
SSA1$Date <- as.Date(SSA1$Date)
#This illustrate a comparison of the daily cases for Death, Confirmed, Hospitalised and ICU over the study period
SSA_chart <- SSA1 %>% pivot_longer(cols = Daily.Confirmed:Intensive.Care.Unit..ICU. ,
names_to = "Cases",
values_to = "Value")
COLORS <- c(Daily.Confirmed = "#c381fd", Daily.Death ="#4815aa",
Still.Hospitalised = "#f2626b" , Intensive.Care.Unit..ICU. = "#feba4f")
ggplot(SSA_chart, aes(x = Date, y = Value, group = Cases, color = Cases)) +
geom_line(size = 0.9) +
scale_color_manual(values = COLORS)+
scale_y_continuous("Cases", limits = c(0,5500)) +
labs(title="Comparison of Daily Cases\nfor Death, Confirmed, Hospitalised and ICU")+
theme(legend.title = element_text(color = "blue", size = 10)) +
my_theme + theme(axis.title.x = element_blank(),
legend.position = "bottom") +
scale_x_date(date_breaks = "1 day") +
ggthemes::theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) Data Source 2: SG TWEETER DATA
# We observed 7-days data usually capped below 3000 tweets per extraction.
# sg_tweets <- search_tweets(q = "#coronavirus OR #covid19 OR #COVID OR #stayhome OR #Covid-19 OR #pandemic OR #virus OR #social-distance OR #self-quarantine OR $swab-test OR #PCR OR #infection-rate",
# n = 3000,
# lang = "en",
# include_rts = F,
# geocode = lookup_coords("singapore")
sg_tweets <- readRDS("tweeter_data.rds")Let’s explore our tweets data in regards to COVID-19 from our first extraction on 18th October to understand sentiment after recent sharp rise in number of local cases and death since end-September.
We also identified 2 key events over the period to analyse further to answer our research question if the event will have significance influence on the public sentiment.
2021-10-20
2021-11-08
It is worth noting that whenever there is a major announcement by the government, there will be jump on the number of tweets.
# Basic EDA on amount of tweet in time (ALL)
options(repr.plot.width=20, repr.plot.height=9)
sg_tweets %>%
select(created_at) %>%
mutate(date = ymd(as.Date(created_at))) %>%
group_by(date) %>%
summarise(n = n(), .groups = "drop_last") %>%
ggplot(aes(x=date, y = n)) +
geom_line(size = 1, color = my_colors[1]) +
coord_cartesian(clip = 'off') +
geom_vline(xintercept = as.Date('2021-10-20'), linetype="dotted", size = 1.5, color = "red") +
geom_vline(xintercept = as.Date('2021-11-08'), linetype="dotted", size = 1.5, color = "darkblue") +
my_theme + theme(axis.title.x = element_blank()) +
scale_x_date(date_breaks = "1 day") +
ggthemes::theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) +
labs(title = "Number of COVID-19 Tweets shared between 10th Oct - 15th Nov", subtitle = "Number of tweets spiked on key events") +
geom_label(aes(x=as.Date('2021-10-19'), y = 380, label = "PM Lee's address on COVID-19"), color = "red", size = 4, angle = 90, fontface = "bold") +
geom_label(aes(x=as.Date('2021-11-07'), y = 380, label = "More Opening on COVID-19 restrictions" ), color = "darkblue", size = 4, angle = 90, fontface = "bold") Data Source 2: Tidy & Transform
# Step 1: Tokenization ----
sg_tweets_id <- sg_tweets %>%
mutate(created_at = as.Date(sg_tweets$created_at)) %>%
rowid_to_column("id")
tidy_tweets_token <- sg_tweets_id %>%
drop_na(text) %>%
select(id, created_at, status_id, text) %>%
filter(text != "") %>%
unnest_tokens(word, text, token = "tweets")
# Step 2: Cleaning ----
tweets_cleaned <- tidy_tweets_token %>%
anti_join(tidytext::stop_words)
# Manual cleaning, filtering out unwanted words
tweets_token_cleaned <- tweets_cleaned %>%
filter(!word %in% c("singapore", "covid", "covid19","positive","negative","oct","nov","news","amp","reuters","news","daily","malaysia","november","october","october","press","journal","amid","weekly","days","weeks","china","chinese","report","am","pm","dont","taking","found","morning","bloomberg","months","month","india","media","week","read","reports","data","europe","monday","tuesday","wednesday","thursday","friday","satursday","sunday","wall","street") & !str_detect(word,"^#|^@") & !str_detect(word, "^[:digit:]+$"))covid_tweets_for_wc <- tweets_token_cleaned %>%
group_by(word) %>%
summarise(frequency = n()) %>%
arrange(desc(frequency))
covid_tweets_for_wc %>%
filter(frequency > 3) %>%
wordcloud2(backgroundColor = "black",
color = "random-light")# A Postive-Negative Word Cloud by using BING
BING <- get_sentiments("bing")
tweets_token_cleaned %>%
inner_join(BING, by="word") %>%
count(word, sentiment, sort=T) %>%
acast(word ~ sentiment, value.var = "n", fill=0) %>%
comparison.cloud(colors=my_colors[c(5, 1)], max.words = 400, title.size = 2,
scale = c(3,.5))AFINN <- get_sentiments("afinn")
## TOP 3 MOST NEGATIVE TWEET ----
tweets_AFINN_indexed <- tweets_token_cleaned %>%
inner_join(AFINN, by = "word")
tweet_level_sentiment <- tweets_AFINN_indexed %>%
group_by(id) %>%
summarise(average_sentiment = mean(value),
n_of_words_indexed = n()
)
top3_negative <- tweet_level_sentiment %>%
arrange(average_sentiment) %>%
head(3)
sg_tweets_id %>%
filter(id %in% top3_negative$id ) %>%
select(text) %>%
pull()## [1] "'They don't choose to have it': Why Covid-19 is hell for people with OCD"
## [2] "Asked my mum to get me donuts 🍩 I’ve no idea why i did that when I can’t taste no shit. Arghh but just cravings 😫 fuck covid 19, fuck the monthly cycle, fuck everything! 😖"
## [3] "deadass my eyes went from covid 19 then positive bitch"
# TOP 3 MOST POSITIVE TWEETS ----
top3_positive <- tweet_level_sentiment %>%
arrange(desc(average_sentiment)) %>%
head(3)
sg_tweets_id %>%
filter(id %in% top3_positive$id) %>%
select(text) %>%
pull()## [1] "…or else they find alternative employment. They are also bringing in a vax passport to enter some venues buildings etc, unless there is a current Covid-19 test (often valid only for a day or 2). It's basically becoming a no vax, no job, & no fun situation. This I approve."
## [2] "“MOH said the doctor had no known medical conditions and was partially vaccinated with a non-mRNA Covid-19 vaccine under the Special Access Route.”\n\nWow, this has to be stated so explicitly. I wonder why? \n\n @Huigoon"
## [3] "@DavidBieleski @bergeron_brent @katiehasedits This trend was evident on 9Aug 2020, Back then out of the 3, SG had the most cases with 55,104, whilst NZ, the least with 1,219. Even back then Greece the highest with 213 COVID-19 related deaths.\n\nFast forward to 5 Nov 2021, Greece is the winner in Covid-19 cases & deaths."
Distribution Breakdown by Emotion Class using NRC technique
NRC <- get_sentiments("nrc")
options(repr.plot.width=15, repr.plot.height=9)
tweets_token_cleaned %>%
inner_join(NRC, "word") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
count(sentiment, sort=T) %>%
ggplot(aes(x=reorder(sentiment, n), y=n)) +
geom_bar(stat="identity", aes(fill=n), show.legend=F) +
geom_label(aes(label=format(n, big.mark = ",")), size=5, fill="white") +
labs(x="Sentiment", y="Frequency", title="What is the overall mood in Tweets?") +
scale_fill_gradient(low = my_colors[3], high = my_colors[1], guide="none") +
coord_flip() +
my_theme + theme(axis.text.x = element_blank())Most Fequent Words by Emotion Class
#options(repr.plot.width=25, repr.plot.height=9)
tweets_token_cleaned %>%
inner_join(NRC, "word") %>%
count(sentiment, word, sort=T) %>%
group_by(sentiment) %>%
arrange(desc(n)) %>%
slice(1:7) %>%
ggplot(aes(x=reorder(word, n), y=n)) +
geom_col(aes(fill=sentiment), show.legend = F) +
facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 5) +
coord_flip() +
my_theme + theme(axis.text.x = element_blank()) +
labs(x="Word", y="Frequency", title="Sentiment split by most frequent words") +
scale_fill_manual(values = c(my_colors, "#BE82AF", "#9D4387", "#DEC0D7",
"#40BDC8", "#80D3DB", "#BFE9ED"))20 Oct 2021
Here, we are interested in a research question: We have been taking note of key events announce by the government and we are keen to know if the announcement from the leadership on 20 Oct will affect or change public sentiment.
We are going to use Regression Discontinuity Analysis on the causal inference and effect.
Firstly, we explore the data with 10 days before and after PM Lee’s address, assuming date close to the cut off on 20 Oct has more relevant effects.
Using AFINN technique
sentiment_daily <- tweets_AFINN_indexed %>%
group_by(created_at) %>%
summarise(average_sentiment = mean(value),
n_of_words_indexed = n())
# Plot
sentiment_daily %>%
filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) %>%
ggplot(aes(x = created_at, y = average_sentiment) ) +
geom_point(size = 2, color = my_colors[1]) +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1, linetype="dotdash", color = my_colors[6]) +
scale_x_date(date_breaks = "1 day") +
ggtitle("Distribution of Average Sentiment \n10 days before & after PM address") +
ggthemes::theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 ))Using NRC technique for Emotion Classification
# Extract for analysis period
tweets_token_analysis_period <- tweets_token_cleaned %>%
filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30'))
classified_sentiment <- tweets_token_analysis_period %>%
inner_join(NRC, "word") %>%
group_by(sentiment, created_at) %>%
summarise(cnt = n())
# Plot Chart
classified_sentiment %>%
filter(!sentiment %in% c("positive", "negative")) %>%
ggplot(aes(x=created_at, y =cnt, color = sentiment)) +
geom_point() +
facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 4) +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8]) +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d") +
theme(axis.text.x = element_text(angle = 45, size = rel(0.8), vjust = 1, hjust = 1 )) +
labs(y="Count of Emotional Words", x="Period of Date")Using Radar Chart, another visualisation chart.
# Data transformation
char_sentiment <- classified_sentiment %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
group_by(sentiment, covid_address_effect) %>%
summarise(char_sentiment_count = sum(cnt))
total_char <- classified_sentiment %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
group_by(covid_address_effect) %>%
summarise(total = sum(cnt))
# Plot Chart
char_sentiment %>%
inner_join(total_char, by="covid_address_effect") %>%
mutate(percent = char_sentiment_count / total * 100 ) %>%
select(-char_sentiment_count, -total) %>%
spread(covid_address_effect, percent) %>%
radarchart::chartJSRadar(showToolTipLabel = T, main="Any Effects on the Emotion Class Percentage After Address? - No", maxScale=25, responsive=T,addDots = T, colMatrix = grDevices::col2rgb(my_colors[c(2,4)]),lineAlpha = 0.7, polyAlpha =0.2)OLS Linear Regression on average sentiment over time period
merged_dataset_RDD <- SSA1 %>%
inner_join(sentiment_daily, by = c("Date" = "created_at")) %>%
filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-10-30'))
# add dummy variable for pre-effect = 0, and post-effect = 1
merged_dataset_RDD <- merged_dataset_RDD %>%
mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))
merged_dataset_RDD %>%
lm(average_sentiment ~ covid_address_effect + I(Date - as.Date('2021-10-20')),.) %>%
summary()##
## Call:
## lm(formula = average_sentiment ~ covid_address_effect + I(Date -
## as.Date("2021-10-20")), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36720 -0.09456 0.01933 0.08557 0.33828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.336808 0.082404 -4.087 0.000691 ***
## covid_address_effectBefore -0.240103 0.153254 -1.567 0.134596
## I(Date - as.Date("2021-10-20")) -0.007816 0.012640 -0.618 0.544075
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1754 on 18 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1164
## F-statistic: 2.318 on 2 and 18 DF, p-value: 0.1271
merged_dataset_RDD %>%
ggplot(aes(x = Date, y = average_sentiment)) +
geom_point(aes(color = covid_address_effect)) +
geom_smooth(method = "lm") +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
ggthemes::theme_fivethirtyeight() +
ylab("Average Sentiment") +
theme(axis.title.y = element_text(), legend.position = "bottom") +
labs(title="OLS Simple Regression Model")We perform Regression Discontinuity Analysis on the effects of PM address event.
We expect to observe there is a high volume on Oct. 20, jump in sentiment score/count on Oct. 19 and Oct. 21
# RDD
RDD <- merged_dataset_RDD %>%
ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
geom_point() +
geom_smooth(method = "lm") +
ggthemes::theme_fivethirtyeight() +
ggtitle("Regression Discontinuity Analysis") +
ylab("Average Sentiment") +
theme(axis.title.y = element_text()) +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])
# Difference in Means Test
RDD_box <- merged_dataset_RDD %>%
ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
geom_boxplot(outlier.colour="black",
outlier.size=2, notch=FALSE) +
geom_point() +
ggthemes::theme_fivethirtyeight() +
ggtitle("Test for Significant Difference") +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])
gridExtra::grid.arrange(RDD, RDD_box, ncol=2)Perform T-test to find significance in difference between 2 groups (Before and After PM address)
# Conduct a difference of means test
# Hypothesis: H0 : mean of pre-address_effect = mean of post-address_effect
merged_dataset_RDD %>%
t.test(average_sentiment ~ covid_address_effect, .)##
## Welch Two Sample t-test
##
## data: average_sentiment by covid_address_effect
## t = 2.1322, df = 17.963, p-value = 0.04705
## alternative hypothesis: true difference in means between group After and group Before is not equal to 0
## 95 percent confidence interval:
## 0.002298216 0.313764096
## sample estimates:
## mean in group After mean in group Before
## -0.3758900 -0.5339212
For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.
pacman::p_load(Hmisc, broom, DT)
merged_dataset <- SSA1 %>%
inner_join(sentiment_daily, by = c("Date" = "created_at")) %>%
filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-11-12')) %>%
mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))
merged_dataset$Date <- as.Date(merged_dataset$Date)
merged_dataset <- merged_dataset[ , -(7:8)]
#Getting Summary
merged_dataset %>%
select("average_sentiment", "Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU.") %>%
summary(.)## average_sentiment Daily.Confirmed Daily.Deaths Still.Hospitalised
## Min. :-0.7955 Min. :1767 Min. : 6.00 Min. :1434
## 1st Qu.:-0.5233 1st Qu.:2943 1st Qu.: 9.00 1st Qu.:1584
## Median :-0.4250 Median :3182 Median :12.00 Median :1640
## Mean :-0.4001 Mean :3206 Mean :12.03 Mean :1633
## 3rd Qu.:-0.2721 3rd Qu.:3472 3rd Qu.:14.75 3rd Qu.:1686
## Max. : 0.1036 Max. :5324 Max. :18.00 Max. :1757
## Intensive.Care.Unit..ICU.
## Min. :41.00
## 1st Qu.:58.25
## Median :64.00
## Mean :61.88
## 3rd Qu.:68.50
## Max. :75.00
merged_dataset %>%
select(average_sentiment, Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>%
as.matrix(.) %>%
rcorr(.) %>%
tidy(.) %>%
rename(variable_1 = column1,
variable_2 = column2,
corr = estimate) %>%
mutate(abs_corr = abs(corr)
)| variable_1 | variable_2 | corr | n | p.value | abs_corr |
|---|---|---|---|---|---|
| Daily.Confirmed | average_sentiment | -0.165 | 34 | 0.351 | 0.165 |
| Daily.Deaths | average_sentiment | 0.115 | 34 | 0.518 | 0.115 |
| Daily.Deaths | Daily.Confirmed | 0.0864 | 34 | 0.627 | 0.0864 |
| Still.Hospitalised | average_sentiment | 0.329 | 34 | 0.0573 | 0.329 |
| Still.Hospitalised | Daily.Confirmed | 0.152 | 34 | 0.39 | 0.152 |
| Still.Hospitalised | Daily.Deaths | 0.0675 | 34 | 0.704 | 0.0675 |
| Intensive.Care.Unit..ICU. | average_sentiment | 0.307 | 34 | 0.0773 | 0.307 |
| Intensive.Care.Unit..ICU. | Daily.Confirmed | 0.103 | 34 | 0.56 | 0.103 |
| Intensive.Care.Unit..ICU. | Daily.Deaths | 0.168 | 34 | 0.342 | 0.168 |
| Intensive.Care.Unit..ICU. | Still.Hospitalised | 0.415 | 34 | 0.0145 | 0.415 |
It was observed that Intensive.Care Unit and Still Hospitalized has the highest correlation from the result.
We will use Merged Dataset for predicting average sentiment based on daily ICU, Confirmed, Hospitalized and Daily Deaths cases. We’ll randomly split the data into training set (70% for building a predictive model) and test set (30% for evaluating the model). Make sure to set seed for reproducibility.
# Split the data into training and test set
set.seed(1234)
training.samples <- merged_dataset$average_sentiment %>%
createDataPartition(p = 0.7, list = FALSE)
train.data <- merged_dataset[training.samples, ]
test.data <- merged_dataset[-training.samples, ]
#Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using ~.:
model <- lm(average_sentiment ~., data = train.data)
# Summarize the model
summary(model)##
## Call:
## lm(formula = average_sentiment ~ ., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24035 -0.13448 -0.01832 0.07180 0.43221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -93.42312312 126.42061809 -0.739 0.469
## Date 0.00486328 0.00669085 0.727 0.476
## Daily.Confirmed -0.00007253 0.00008996 -0.806 0.430
## Still.Hospitalised 0.00060785 0.00058399 1.041 0.310
## Daily.Deaths -0.00254341 0.01248686 -0.204 0.841
## Intensive.Care.Unit..ICU. 0.00369593 0.00707699 0.522 0.607
##
## Residual standard error: 0.2008 on 20 degrees of freedom
## Multiple R-squared: 0.2217, Adjusted R-squared: 0.02712
## F-statistic: 1.139 on 5 and 20 DF, p-value: 0.3723
# Make predictions
predictions <- model %>% predict(test.data)
RMSE(predictions, test.data$average_sentiment)## [1] 0.2003003
# (b) R-square
R2(predictions, test.data$average_sentiment)## [1] 0.2647791
We ran two regression models. The first regressed daily cases onto average sentiment (model1).
\[ \begin{eqnarray} \widehat{average.sentiment} = intercept + b_1Daily.Deaths + b_2Daily.Confirmed + b_3Still.Hospitalised + b_4Intensive.Care.Unit..ICU.+ \epsilon \end{eqnarray} \]
Our key investigation lies in the next model, in which we regressed daily cases orientations, along with interaction terms, onto average sentiment (model2).
\[ \begin{eqnarray} \widehat{average.sentiment} = intercept + b_1Daily.Deaths + b_2Daily.Confirmed + b_3Still.Hospitalised + b_4Daily.Deaths \times icu + b_5Daily.Confirmed \times icu + b_6Still.Hospitalised \times icu + \epsilon \end{eqnarray} \]
model1 <- lm(average_sentiment~ Daily.Deaths + Daily.Confirmed + Still.Hospitalised + Intensive.Care.Unit..ICU.,
train.data)
tidy(model1) %>% as_tibble()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.54 | 0.922 | -1.67 | 0.11 |
| Daily.Deaths | 0.00132 | 0.0112 | 0.118 | 0.907 |
| Daily.Confirmed | -9.34e-05 | 8.43e-05 | -1.11 | 0.28 |
| Still.Hospitalised | 0.000577 | 0.000576 | 1 | 0.328 |
| Intensive.Care.Unit..ICU. | 0.00734 | 0.00493 | 1.49 | 0.151 |
glance(model1)| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.201 | 0.049 | 0.199 | 1.32 | 0.294 | 4 | 7.91 | -3.83 | 3.72 | 0.828 | 21 | 26 |
model2 <- lm(average_sentiment ~ (Daily.Confirmed + Daily.Deaths +Still.Hospitalised) * Intensive.Care.Unit..ICU.,
train.data)
tidy(model2) %>% as_tibble()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -8.41 | 7.57 | -1.11 | 0.281 |
| Daily.Confirmed | 0.00157 | 0.000625 | 2.52 | 0.0215 |
| Daily.Deaths | -0.3 | 0.115 | -2.62 | 0.0175 |
| Still.Hospitalised | 0.00384 | 0.00454 | 0.845 | 0.409 |
| Intensive.Care.Unit..ICU. | 0.1 | 0.121 | 0.831 | 0.417 |
| Daily.Confirmed:Intensive.Care.Unit..ICU. | -2.61e-05 | 9.78e-06 | -2.67 | 0.0155 |
| Daily.Deaths:Intensive.Care.Unit..ICU. | 0.0046 | 0.00174 | 2.64 | 0.0166 |
| Still.Hospitalised:Intensive.Care.Unit..ICU. | -4.13e-05 | 7.33e-05 | -0.563 | 0.581 |
glance(model2)| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.495 | 0.299 | 0.171 | 2.52 | 0.0539 | 7 | 13.9 | -9.76 | 1.56 | 0.523 | 18 | 26 |
summary(model1)##
## Call:
## lm(formula = average_sentiment ~ Daily.Deaths + Daily.Confirmed +
## Still.Hospitalised + Intensive.Care.Unit..ICU., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24157 -0.12768 -0.01705 0.05730 0.46390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53616590 0.92152726 -1.667 0.110
## Daily.Deaths 0.00131739 0.01117317 0.118 0.907
## Daily.Confirmed -0.00009339 0.00008429 -1.108 0.280
## Still.Hospitalised 0.00057723 0.00057589 1.002 0.328
## Intensive.Care.Unit..ICU. 0.00734444 0.00493239 1.489 0.151
##
## Residual standard error: 0.1986 on 21 degrees of freedom
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.04898
## F-statistic: 1.322 on 4 and 21 DF, p-value: 0.2944
summary(model2)##
## Call:
## lm(formula = average_sentiment ~ (Daily.Confirmed + Daily.Deaths +
## Still.Hospitalised) * Intensive.Care.Unit..ICU., data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24710 -0.10603 -0.00932 0.09528 0.34621
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -8.410352619 7.566961238 -1.111
## Daily.Confirmed 0.001574364 0.000625087 2.519
## Daily.Deaths -0.300146413 0.114706230 -2.617
## Still.Hospitalised 0.003840899 0.004544835 0.845
## Intensive.Care.Unit..ICU. 0.100410137 0.120896846 0.831
## Daily.Confirmed:Intensive.Care.Unit..ICU. -0.000026150 0.000009783 -2.673
## Daily.Deaths:Intensive.Care.Unit..ICU. 0.004601271 0.001742310 2.641
## Still.Hospitalised:Intensive.Care.Unit..ICU. -0.000041252 0.000073300 -0.563
## Pr(>|t|)
## (Intercept) 0.2810
## Daily.Confirmed 0.0215 *
## Daily.Deaths 0.0175 *
## Still.Hospitalised 0.4091
## Intensive.Care.Unit..ICU. 0.4171
## Daily.Confirmed:Intensive.Care.Unit..ICU. 0.0155 *
## Daily.Deaths:Intensive.Care.Unit..ICU. 0.0166 *
## Still.Hospitalised:Intensive.Care.Unit..ICU. 0.5805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1705 on 18 degrees of freedom
## Multiple R-squared: 0.4952, Adjusted R-squared: 0.2989
## F-statistic: 2.522 on 7 and 18 DF, p-value: 0.05385
We tested if model2, with interaction terms, enhances the explanatory power of the model using anova function.
anova(model1, model2)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 21 | 0.828 | ||||
| 18 | 0.523 | 3 | 0.305 | 3.49 | 0.0371 |
The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.
library(gvlma)
gvlma(model2)##
## Call:
## lm(formula = average_sentiment ~ (Daily.Confirmed + Daily.Deaths +
## Still.Hospitalised) * Intensive.Care.Unit..ICU., data = train.data)
##
## Coefficients:
## (Intercept)
## -8.41035262
## Daily.Confirmed
## 0.00157436
## Daily.Deaths
## -0.30014641
## Still.Hospitalised
## 0.00384090
## Intensive.Care.Unit..ICU.
## 0.10041014
## Daily.Confirmed:Intensive.Care.Unit..ICU.
## -0.00002615
## Daily.Deaths:Intensive.Care.Unit..ICU.
## 0.00460127
## Still.Hospitalised:Intensive.Care.Unit..ICU.
## -0.00004125
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = model2)
##
## Value p-value Decision
## Global Stat 3.20136 0.5247 Assumptions acceptable.
## Skewness 0.70805 0.4001 Assumptions acceptable.
## Kurtosis 0.14447 0.7039 Assumptions acceptable.
## Link Function 2.32459 0.1273 Assumptions acceptable.
## Heteroscedasticity 0.02425 0.8763 Assumptions acceptable.
library(ggthemes)
theme_set(theme_fivethirtyeight())
library(ggfortify)
autoplot(gvlma(model2))library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(model1);vif(model2)## Daily.Deaths Daily.Confirmed Still.Hospitalised
## 1.075882 1.087332 1.124124
## Intensive.Care.Unit..ICU.
## 1.195660
## Daily.Confirmed
## 81.10537
## Daily.Deaths
## 153.80602
## Still.Hospitalised
## 94.96315
## Intensive.Care.Unit..ICU.
## 974.33724
## Daily.Confirmed:Intensive.Care.Unit..ICU.
## 164.60210
## Daily.Deaths:Intensive.Care.Unit..ICU.
## 196.10806
## Still.Hospitalised:Intensive.Care.Unit..ICU.
## 1215.39954
kable in R Markdownlibrary(knitr) # Please install the package "knitr" first.
library(kableExtra) # You might want to use package "kableExtra" as well.##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:huxtable':
##
## add_footnote
## The following object is masked from 'package:dplyr':
##
## group_rows
kable(tidy(model2))%>%
kable_styling("striped", full_width = T, fixed_thead = T) %>%
column_spec(c(1, 5), bold = T) %>%
row_spec(c(2, 4, 6,8), bold = T, color = "white", background = "#ff6347")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -8.4103526 | 7.5669612 | -1.1114571 | 0.2809975 |
| Daily.Confirmed | 0.0015744 | 0.0006251 | 2.5186300 | 0.0214597 |
| Daily.Deaths | -0.3001464 | 0.1147062 | -2.6166531 | 0.0174761 |
| Still.Hospitalised | 0.0038409 | 0.0045448 | 0.8451129 | 0.4091412 |
| Intensive.Care.Unit..ICU. | 0.1004101 | 0.1208968 | 0.8305439 | 0.4171120 |
| Daily.Confirmed:Intensive.Care.Unit..ICU. | -0.0000261 | 0.0000098 | -2.6729591 | 0.0155165 |
| Daily.Deaths:Intensive.Care.Unit..ICU. | 0.0046013 | 0.0017423 | 2.6409030 | 0.0166049 |
| Still.Hospitalised:Intensive.Care.Unit..ICU. | -0.0000413 | 0.0000733 | -0.5627803 | 0.5805272 |
kable(glance(model2))%>%
kable_styling("striped", full_width = T, font_size = 12) %>%
column_spec(c(2,4), bold = T, color = "white", background = "#ff6347")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.4951796 | 0.2988606 | 0.1705067 | 2.522321 | 0.053852 | 7 | 13.88151 | -9.763028 | 1.55984 | 0.5233058 | 18 | 26 |
The regression analysis came up with two significant interaction terms.
First, it appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily confirmed cases.
Second, it appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily deaths cases.
To visualize the OLS regression analysis performed above, we stored the OLS regression model’s predictions.
Daily death is a discrete and countable variable in our model. To make it easier for interpretation, we categorized them into three levels (mean below 1SD, mean, and mean above 1SD). The following figure represents the three lines that represent differing Daily Deaths set at M-1SD, Mean, M+1SD, as noted above, and how differing Daily Deaths make differences to relationships between Intensive Care Cases and Average Sentiment.
#Daily death
merged_dataset %>% summarise(sd(Daily.Deaths), mean(Daily.Deaths)) %>%
mutate(`sd(Daily.Deaths)` = round(`sd(Daily.Deaths)`, digits = 0),
`mean(Daily.Deaths)` = round(`mean(Daily.Deaths)`, digits = 0))| sd(Daily.Deaths) | mean(Daily.Deaths) |
|---|---|
| 3 | 12 |
grid_group3 <- merged_dataset %>%
data_grid(Intensive.Care.Unit..ICU. , Daily.Deaths = c(9, 12, 15), # c(-1.26, 0.00, 1.26),
Still.Hospitalised =0, Daily.Confirmed = 0) %>%
add_predictions(model2)
grid <- grid_group3 %>%
mutate(Intensive.Care.Unit..ICU. = Intensive.Care.Unit..ICU. + mean(merged_dataset$Intensive.Care.Unit..ICU.), Daily.Deaths = factor(as.double(factor(Daily.Deaths))))
ggplot(grid, aes(x = Intensive.Care.Unit..ICU., y = pred, color = factor(Daily.Deaths))) +
geom_line(size = 1) +
scale_color_discrete(breaks = c(1, 2, 3),
label=c("Low in Death",
"Mean Death",
"High in Death")) +
labs(x = "Intensive Care Unit",
y = "Average Sentiment",
color = "Daily Death") +
coord_cartesian(ylim = c(-4, -0.8), xlim = c(100, 138)) +
theme_linedraw() +
theme(legend.position= "top")The following figure represents the three lines that represent differing Daily Cases set at M-1SD, Mean, M+1SD. It will show how differing Daily Cases make differences to relationships between Intensive Care Cases and Average Sentiment.
#Daily confirm
merged_dataset %>% summarise(sd(Daily.Confirmed), mean(Daily.Confirmed)) %>%
mutate(`sd(Daily.Confirmed)` = round(`sd(Daily.Confirmed)`, digits = 0),
`mean(Daily.Confirmed)` = format(`mean(Daily.Confirmed)`, digit=4, scientific=FALSE)) | sd(Daily.Confirmed) | mean(Daily.Confirmed) |
|---|---|
| 640 | 3206 |
grid_group4 <- merged_dataset %>%
data_grid(Intensive.Care.Unit..ICU. , Daily.Confirmed = c(2566, 3206, 3846),
Still.Hospitalised =0, Daily.Deaths = 0) %>%
add_predictions(model2)
grid1 <- grid_group4 %>%
mutate(Intensive.Care.Unit..ICU. = Intensive.Care.Unit..ICU. + mean(merged_dataset$Intensive.Care.Unit..ICU.), Daily.Confirmed = factor(as.double(factor(Daily.Confirmed))))
ggplot(grid1, aes(x = Intensive.Care.Unit..ICU., y = pred, color = factor(Daily.Confirmed))) +
geom_line(size = 0.7) +
scale_color_discrete(breaks = c(1,2,3),
label=c("Low in Daily Confirmed",
"Mean Daily Confirmed",
"High in Daily Confirmed")) +
labs(x = "Intensive Care Unit",
y = "Average Sentiment",
color = "Daily Confirmed") +
coord_cartesian(ylim = c(-3.5, -1.5), xlim = c(100, 140)) +
theme_linedraw() +
theme(legend.position= "top")pacman::p_load(jtools, huxtable, ggstance, interactions)
m1 <- lm(average_sentiment~ Daily.Deaths + Daily.Confirmed +Still.Hospitalised + Intensive.Care.Unit..ICU.,
train.data)
m2 <- lm(average_sentiment ~ (Daily.Confirmed + Daily.Deaths +Still.Hospitalised) * Intensive.Care.Unit..ICU.,
train.data)
export_summs(m1, m2,
error_format = "(t = {statistic}, p = {p.value})",
align = "right",
model.names = c("Main Effects Only", "with Interactions"),
digits = 3)| Main Effects Only | with Interactions | |
|---|---|---|
| (Intercept) | -1.536 | -8.410 |
| (t = -1.667, p = 0.110) | (t = -1.111, p = 0.281) | |
| Daily.Deaths | 0.001 | -0.300 * |
| (t = 0.118, p = 0.907) | (t = -2.617, p = 0.017) | |
| Daily.Confirmed | -0.000 | 0.002 * |
| (t = -1.108, p = 0.280) | (t = 2.519, p = 0.021) | |
| Still.Hospitalised | 0.001 | 0.004 |
| (t = 1.002, p = 0.328) | (t = 0.845, p = 0.409) | |
| Intensive.Care.Unit..ICU. | 0.007 | 0.100 |
| (t = 1.489, p = 0.151) | (t = 0.831, p = 0.417) | |
| Daily.Confirmed:Intensive.Care.Unit..ICU. | -0.000 * | |
| (t = -2.673, p = 0.016) | ||
| Daily.Deaths:Intensive.Care.Unit..ICU. | 0.005 * | |
| (t = 2.641, p = 0.017) | ||
| Still.Hospitalised:Intensive.Care.Unit..ICU. | -0.000 | |
| (t = -0.563, p = 0.581) | ||
| N | 26 | 26 |
| R2 | 0.201 | 0.495 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
plot_summs(m1, m2,
scale = T,
plot.distributions = T,
model.names = c("Main Effects Only", "with Interactions")) +
theme(legend.position = "top")sim_slopes(m2,
pred = Intensive.Care.Unit..ICU.,
modx = Daily.Deaths,
johnson_neyman = F)## SIMPLE SLOPES ANALYSIS
##
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 8.38994 (- 1 SD):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.01 0.01 -1.54 0.14
##
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 12.07692 (Mean):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.00 0.01 0.83 0.42
##
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 15.76391 (+ 1 SD):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.02 0.01 2.48 0.02
sim_slopes(m2,
pred = Daily.Deaths,
modx = Intensive.Care.Unit..ICU.,
johnson_neyman = T)## JOHNSON-NEYMAN INTERVAL
##
## When Intensive.Care.Unit..ICU. is OUTSIDE the interval [57.22, 71.98], the
## slope of Daily.Deaths is p < .05.
##
## Note: The range of observed values of Intensive.Care.Unit..ICU. is [41.00,
## 75.00]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 54.00306 (- 1 SD):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.05 0.02 -2.30 0.03
##
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 62.80769 (Mean):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.01 0.01 -1.02 0.32
##
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 71.61232 (+ 1 SD):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.03 0.01 2.05 0.05
From our data science project, we could find the following three findings:
The first case of COVID-19 found in Singapore was confirmed on 23 January 2020, and since nationwide partial lockdown or circuit breaker kicked in from 7 April 2020 to 1 June 2020, Singapore have been experiencing policies change from time to time in efforts to be more effectively cope with the surges. As new norm has been rooted in local lifestyle, no apparent jump or trend has been observed in public sentiment. In fact, we further analyse the effects from the most recent key announcements, a sudden increase of number of tweets and jump in overall sentiment appears that there is a causation effect, but it did not significantly drive any specific emotion class to form trending. Despite that social curb extension announced expected to cause negative sentiment, PM address has overwhelmed the effects on it and sentiment emerge with more trusts to the leadership. Thus, information delivering, create awareness and generate leads seems to be effective in producing positive public sentiment.
The relationship between daily average sentiment and daily ICU cases depends on daily cases and daily death. It appears that if there is a high death with lower daily ICU cases, the average sentiment will be lower. However, if there is a higher ICU cases with high death rate, the average sentiment is higher. If there is a low death with low intensive care case, the average sentiment is higher and if there is a low death with high intensive care cases, the sentiment is lower.
The relationship between daily cases and daily ICU is that be it low or high daily cases, the more intensive care cases, the better the average sentiment. The result is constant for both low or high daily cases.
There are two angles that we took to approach this. First is to analyse if there is any impact on the twitter feeds based on the major events announced by the government. From the two specific events that we noted, it was found that there is a sudden spikes in the number of tweets. Secondly we have analysed the correlation between daily average sentiment and the the number of Covid cases(death, ICU, hospitalized and daily cases). As shown in the result, we feel that we will need a longer time frame to analyse as the result shows that be it low or high daily cases, the result is constant but a higher death rate with low ICU cases will have a lower average sentiment.
The public sentiment reflected and analysed by tweets data especially effective among younger age group who spend most of their time in social media and carefully tweets what is in their mind. This can serve only a small sample of a whole local population. If possible, we also need to filter out news publisher tweets and focus on individual tweets. For this analysis, we are comparing sentiment change within group based on causal inference driven by nation-wide speech and announcements. On how COVID-19 has changed the sentiment in the society, we have to identify and aware of timeline on key events happened and also view from a longer term from pre-covid and covid era, hence a relative longer time period of data will provide more insights for analysing the differences. We are hold back by the free Twitter developer account which eligible to extract tweets up to 7 days in the past.
Local slang words or Singlish are casually and frequently used among local community and most of them could be a good gauge of emotion, E.g. even a dialect vulgar should reflect anger, thus enriching the emotion dictionary like NRC can better tune to understand the positive and negative in sentiment.
Feature Importance describe which features are relevant and is another aspect we should explore to help us better understanding of solved problem and sometimes lead to model improvements for accuracy by employing feature selection.
In conclusion, we feel that if we can have a longer runway, we could yield better result. Probably for a 6 months period so that we can analyse more chain of events. Moreover we feel that covid has been there for almost 2 years, public has probably taken this as the new norm and hence a longer period of analysis will give us more data points to analyse.
[1] Julia Silge & David Robinson Text Mining with R - A TIDY APPROACH O’Reilly (2017)
[2] Andrea Cirillo R Data Mining - Implement data mining techniques through practical use cases and real-world datasets Packt> (2017)
[3] Tony Carilli R Companion to Real Econometrics February 2021 https://bookdown.org/carillitony/bailey/chp11.html
[4] Ashwin Malshe(2019-06-25) Data Analytics Applications https://ashgreat.github.io/analyticsAppBook/collect-tweets.html
[5]Hui Xiang Chua Covid-19 Singapore https://data.world/hxchua/covid-19-singapore
[6]Singapore Public Data COVID-19 case numbers https://data.gov.sg/dataset/covid-19-case-numbers
Custom Stop-words for Text Pre-processing in Word Cloud data overview
“singapore”, “covid”, “covid19”,“positive”,“negative”,“oct”,“nov”,“news”,“amp”,“reuters”,“news”,“daily”, “malaysia”,“november”,“october”,“october”,“press”,“journal”,“amid”,“weekly”,“days”,“weeks”,“china”, “chinese”,“report”,“am”,“pm”,“dont”,“taking”,“found”,“morning”,“bloomberg”,“months”,“month”,“india”, “media”,“week”,“read”,“reports”,“data”,“europe”,“monday”,“tuesday”,“wednesday”,“thursday”,“friday”, “satursday”,“sunday”,“wall”,“street”
The objective is to clean those are less relevant and very little meaning to find sentiment, such as punctuation, special character, prefix with number, hashtag, alias, links and custom terms above.
We removed duplicated text in tweets, sent from the same screen name multiple times. For instance, there are several news publishers have posted the same tweet on different days.
sessionInfo()sessionInfo()## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] interactions_1.1.5 ggstance_0.3.5 jtools_2.1.4 kableExtra_1.3.4
## [5] knitr_1.33 car_3.0-11 carData_3.0-4 ggfortify_0.4.12
## [9] ggthemes_4.2.4 gvlma_1.0.0.3 DT_0.19 caret_6.0-90
## [13] glue_1.4.2 rtweet_0.7.0.9000 scales_1.1.1 sandwich_3.0-1
## [17] Hmisc_4.5-0 Formula_1.2-4 survival_3.2-13 lattice_0.20-44
## [21] psych_2.1.9 gridExtra_2.3 huxtable_5.4.0 textdata_0.4.1
## [25] reshape2_1.4.4 wordcloud_2.6 RColorBrewer_1.1-2 wordcloud2_0.2.1
## [29] tidytext_0.3.1 lubridate_1.7.10 modelr_0.1.8 broom_0.7.9
## [33] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [37] readr_2.0.1 tidyr_1.1.4 tibble_3.1.5 ggplot2_3.3.5
## [41] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 systemfonts_1.0.2
## [4] plyr_1.8.6 splines_4.1.0 listenv_0.8.0
## [7] SnowballC_0.7.0 digest_0.6.28 foreach_1.5.1
## [10] htmltools_0.5.2 fansi_0.5.0 magrittr_2.0.1
## [13] checkmate_2.0.0 cluster_2.1.2 openxlsx_4.2.4
## [16] tzdb_0.1.2 recipes_0.1.17 globals_0.14.0
## [19] gower_0.2.2 svglite_2.0.0 jpeg_0.1-9
## [22] colorspace_2.0-2 rvest_1.0.1 rappdirs_0.3.3
## [25] haven_2.4.3 xfun_0.25 crayon_1.4.2
## [28] jsonlite_1.7.2 zoo_1.8-9 iterators_1.0.13
## [31] gtable_0.3.0 ipred_0.9-12 webshot_0.5.2
## [34] future.apply_1.8.1 abind_1.4-5 DBI_1.1.1
## [37] Rcpp_1.0.7 viridisLite_0.4.0 htmlTable_2.2.1
## [40] tmvnsim_1.0-2 foreign_0.8-81 radarchart_0.3.1
## [43] stats4_4.1.0 lava_1.6.10 prodlim_2019.11.13
## [46] htmlwidgets_1.5.4 httr_1.4.2 ellipsis_0.3.2
## [49] pkgconfig_2.0.3 farver_2.1.0 nnet_7.3-16
## [52] sass_0.4.0 dbplyr_2.1.1 utf8_1.2.2
## [55] tidyselect_1.1.1 labeling_0.4.2 rlang_0.4.12
## [58] munsell_0.5.0 cellranger_1.1.0 tools_4.1.0
## [61] cli_3.1.0 generics_0.1.1 pacman_0.5.1
## [64] evaluate_0.14 fastmap_1.1.0 yaml_2.2.1
## [67] ModelMetrics_1.2.2.2 fs_1.5.0 pander_0.6.4
## [70] zip_2.2.0 future_1.23.0 nlme_3.1-152
## [73] xml2_1.3.2 tokenizers_0.2.1 compiler_4.1.0
## [76] rstudioapi_0.13 curl_4.3.2 png_0.1-7
## [79] reprex_2.0.1 bslib_0.2.5.1 stringi_1.7.4
## [82] highr_0.9 Matrix_1.3-4 commonmark_1.7
## [85] vctrs_0.3.8 pillar_1.6.4 lifecycle_1.0.1
## [88] jquerylib_0.1.4 data.table_1.14.2 R6_2.5.1
## [91] latticeExtra_0.6-29 rio_0.5.27 janeaustenr_0.1.5
## [94] parallelly_1.29.0 codetools_0.2-18 MASS_7.3-54
## [97] assertthat_0.2.1 withr_2.4.2 mnormt_2.0.2
## [100] broom.mixed_0.2.7 mgcv_1.8-36 parallel_4.1.0
## [103] hms_1.1.1 grid_4.1.0 rpart_4.1-15
## [106] timeDate_3043.102 class_7.3-19 rmarkdown_2.10
## [109] pROC_1.18.0 base64enc_0.1-3
There are two main segments for this project which are covid cases and twitter mining. Both of us have been involved in these two parts and in fact we met up twice a week for the last month for process to explaining our work and difficulties and reviewing outcome on this project. Yun Xun main focus is on twitter mining and to analyse the sentiment based on regression discontinuity on special occasion whereas Alex is more focus on Covid cases and modeling. Having said that and as mentioned above, there is overlapping of effort from both of us as we have been trying to work seamlessly together to gel the all pieces of work into a complete project.